Modeling Morphologically Rich Languages Using Split Words and Unstructured Dependencies

نویسندگان

  • Deniz Yuret
  • Ergun Biçici
چکیده

We experiment with splitting words into their stem and suffix components for modeling morphologically rich languages. We show that using a morphological analyzer and disambiguator results in a significant perplexity reduction in Turkish. We present flexible n-gram models, FlexGrams, which assume that the n−1 tokens that determine the probability of a given token can be chosen anywhere in the sentence rather than the preceding n−1 positions. Our final model achieves 27% perplexity reduction compared to the standard n-gram model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Morphologically rich languages suffer from data sparsity and out-of-vocabulary words problems. As a result, researchers use morphemes (sub-words) as units in language modeling instead of full-word forms. The use of morphemes in language modeling, however, might lead to a loss of word level dependency since a word can be segmented into 3 or more morphemes and the scope of the morpheme n-gram mig...

متن کامل

Providing Morphological Information for SMT Using Neural Networks

Treating morphologically complex words (MCWs) as atomic units in translation would not yield a desirable result. Such words are complicated constituents with meaningful subunits. A complex word in a morphologically rich language (MRL) could be associated with a number of words or even a full sentence in a simpler language, which means the surface form of complex words should be accompanied with...

متن کامل

Improved Transition-based Parsing by Modeling Characters instead of Words with LSTMs

We present extensions to a continuousstate dependency parsing method that makes it applicable to morphologically rich languages. Starting with a highperformance transition-based parser that uses long short-term memory (LSTM) recurrent neural networks to learn representations of the parser state, we replace lookup based word representations with representations constructed based on the orthograp...

متن کامل

Improving the Performance of Neural Machine Translation Involving Morphologically Rich Languages

The advent of the attention mechanism in neural machine translation models has improved the performance of machine translation systems by enabling selective lookup into the source sentence. In this paper, the efficiencies of translation using bidirectional encoder attention decoder models were studied with respect to translation involving morphologically rich languages. The English–Tamil langua...

متن کامل

Toward Never Ending Language Learning for Morphologically Rich Languages

This work deals with ontology learning from unstructured Russian text. We implement one of the components of Never Ending Language Learner and introduce the algorithm extensions aimed to gather specificity of morphologically rich freeword-order language. We perform several experiments comparing different settings of the training process. We demonstrate that morphological features significantly ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009